24  Lab: Hate crimes

In the session 7 (week 8) we discussed data and society: academic and practices discourse on the social, political and ethical aspects of data science, and discussed how one can responsibly carry out data science research on social phenomena, what ethical and social frameworks can help us to critically approach data science practices and its effects on society, and what are ethical practices for data scientists.

24.1 Datasets

24.1.1 Further datasets

24.1.2 Additional Readings

  • Indicators - critical reviews: The Poverty of Statistics and the Statistics of Poverty: https://www.tandfonline.com/doi/full/10.1080/01436590903321844?src=recsys
  • Indicators in global health: arguments: indicators are usually comprehensible to a small group of experts. Why use indicators then? „Because indicators used in global HIV finance offer openings for engagement to promote accountability (…) some indicators and data truly are better than others, and as they were all created by humans, they all can be deconstructed and remade in other forms” Davis, S. (2020). The Uncounted: Politics of Data in Global Health, Cambridge. doi:10.1017/9781108649544

Indicators - conceptualization

24.2 Hate Crimes

24.2.1 Source:

https://github.com/fivethirtyeight/data/tree/master/hate-crimes

24.2.2 Variables:

Header Definition
state State name
median_household_income Median household income, 2016
share_unemployed_seasonal Share of the population that is unemployed (seasonally adjusted), Sept. 2016
share_population_in_metro_areas Share of the population that lives in metropolitan areas, 2015
share_population_with_high_school_degree Share of adults 25 and older with a high-school degree, 2009
share_non_citizen Share of the population that are not U.S. citizens, 2015
share_white_poverty Share of white residents who are living in poverty, 2015
gini_index Gini Index, 2015
share_non_white Share of the population that is not white, 2015
share_voters_voted_trump Share of 2016 U.S. presidential voters who voted for Donald Trump
hate_crimes_per_100k_splc Hate crimes per 100,000 population, Southern Poverty Law Center, Nov. 9-18, 2016
avg_hatecrimes_per_100k_fbi Average annual hate crimes per 100,000 population, FBI, 2010-2015

24.3 Data exploration

import pandas as pd
df = pd.read_excel('data/hate_Crimes_v2.xlsx')

A reminder: anything with a pd. prefix comes from pandas. This is particulary useful for preventing a module from overwriting inbuilt Python functionality.

Let’s have a look at our dataset

df.tail()
NAME median_household_income share_unemployed_seasonal share_population_in_metro_areas share_population_with_high_school_degree share_non_citizen share_white_poverty gini_index share_non_white share_voters_voted_trump hate_crimes_per_100k_splc avg_hatecrimes_per_100k_fbi
46 Virginia 66155 0.043 0.89 0.866 0.06 0.07 0.459 0.38 0.45 0.36 1.72
47 Washington 59068 0.052 0.86 0.897 0.08 0.09 0.441 0.31 0.38 0.67 3.81
48 West Virginia 39552 0.073 0.55 0.828 0.01 0.14 0.451 0.07 0.69 0.32 2.03
49 Wisconsin 58080 0.043 0.69 0.898 0.03 0.09 0.430 0.22 0.48 0.22 1.12
50 Wyoming 55690 0.040 0.31 0.918 0.02 0.09 0.423 0.15 0.70 0.00 0.26
type(df)
pandas.core.frame.DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 12 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   NAME                                      51 non-null     object 
 1   median_household_income                   51 non-null     int64  
 2   share_unemployed_seasonal                 51 non-null     float64
 3   share_population_in_metro_areas           51 non-null     float64
 4   share_population_with_high_school_degree  51 non-null     float64
 5   share_non_citizen                         48 non-null     float64
 6   share_white_poverty                       51 non-null     float64
 7   gini_index                                51 non-null     float64
 8   share_non_white                           51 non-null     float64
 9   share_voters_voted_trump                  51 non-null     float64
 10  hate_crimes_per_100k_splc                 51 non-null     float64
 11  avg_hatecrimes_per_100k_fbi               51 non-null     float64
dtypes: float64(10), int64(1), object(1)
memory usage: 4.9+ KB

24.3.1 Missing values

Let’s explore the dataset

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 12 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   NAME                                      51 non-null     object 
 1   median_household_income                   51 non-null     int64  
 2   share_unemployed_seasonal                 51 non-null     float64
 3   share_population_in_metro_areas           51 non-null     float64
 4   share_population_with_high_school_degree  51 non-null     float64
 5   share_non_citizen                         48 non-null     float64
 6   share_white_poverty                       51 non-null     float64
 7   gini_index                                51 non-null     float64
 8   share_non_white                           51 non-null     float64
 9   share_voters_voted_trump                  51 non-null     float64
 10  hate_crimes_per_100k_splc                 51 non-null     float64
 11  avg_hatecrimes_per_100k_fbi               51 non-null     float64
dtypes: float64(10), int64(1), object(1)
memory usage: 4.9+ KB

The above tables shows that we have some missing data for some of states. See below too.

df.isna().sum()
NAME                                        0
median_household_income                     0
share_unemployed_seasonal                   0
share_population_in_metro_areas             0
share_population_with_high_school_degree    0
share_non_citizen                           3
share_white_poverty                         0
gini_index                                  0
share_non_white                             0
share_voters_voted_trump                    0
hate_crimes_per_100k_splc                   0
avg_hatecrimes_per_100k_fbi                 0
dtype: int64
import numpy as np
np.unique(df.NAME)
array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Delaware', 'District of Columbia',
       'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana',
       'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
       'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
       'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
       'New Jersey', 'New Mexico', 'New York', 'North Carolina',
       'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
       'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
       'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'], dtype=object)

There aren’t any unexpected values in ‘state’.

24.4 Mapping hate crime across the USA

#using James' code from the last lab: we need  the geospatial polygons of the states in America  
import geopandas as gpd 
import pandas as pd
import altair as alt

geo_states = gpd.read_file('data/gz_2010_us_040_00_500k.json')
#df = pd.read_excel('data/hate_Crimes_v2.xlsx')
geo_states.head()
GEO_ID STATE NAME LSAD CENSUSAREA geometry
0 0400000US23 23 Maine 30842.923 MULTIPOLYGON (((-67.61976 44.51975, -67.61541 ...
1 0400000US25 25 Massachusetts 7800.058 MULTIPOLYGON (((-70.83204 41.60650, -70.82373 ...
2 0400000US26 26 Michigan 56538.901 MULTIPOLYGON (((-88.68443 48.11579, -88.67563 ...
3 0400000US30 30 Montana 145545.801 POLYGON ((-104.05770 44.99743, -104.25015 44.9...
4 0400000US32 32 Nevada 109781.180 POLYGON ((-114.05060 37.00040, -114.04999 36.9...
alt.Chart(geo_states, title='US states').mark_geoshape().encode(
).properties(
    width=500,
    height=300
).project(
    type='albersUsa'
)
# Add the data
#should i rename 'state' to 'NAME'?
geo_states = geo_states.merge(df, on='NAME')
geo_states.head()
GEO_ID STATE NAME LSAD CENSUSAREA geometry median_household_income share_unemployed_seasonal share_population_in_metro_areas share_population_with_high_school_degree share_non_citizen share_white_poverty gini_index share_non_white share_voters_voted_trump hate_crimes_per_100k_splc avg_hatecrimes_per_100k_fbi
0 0400000US23 23 Maine 30842.923 MULTIPOLYGON (((-67.61976 44.51975, -67.61541 ... 51710 0.044 0.54 0.902 NaN 0.12 0.437 0.09 0.45 0.61 2.62
1 0400000US25 25 Massachusetts 7800.058 MULTIPOLYGON (((-70.83204 41.60650, -70.82373 ... 63151 0.046 0.97 0.890 0.09 0.08 0.475 0.27 0.34 0.63 4.80
2 0400000US26 26 Michigan 56538.901 MULTIPOLYGON (((-88.68443 48.11579, -88.67563 ... 52005 0.050 0.87 0.879 0.04 0.09 0.451 0.24 0.48 0.40 3.20
3 0400000US30 30 Montana 145545.801 POLYGON ((-104.05770 44.99743, -104.25015 44.9... 51102 0.041 0.34 0.908 0.01 0.10 0.435 0.10 0.57 0.49 2.95
4 0400000US32 32 Nevada 109781.180 POLYGON ((-114.05060 37.00040, -114.04999 36.9... 49875 0.067 0.87 0.839 0.10 0.08 0.448 0.50 0.46 0.14 2.11
alt.Chart(geo_states, title='PRE-election Hate crime per 100k').mark_geoshape().encode(
    color='avg_hatecrimes_per_100k_fbi',
    tooltip=['NAME', 'avg_hatecrimes_per_100k_fbi']
).properties(
    width=500,
    height=300
).project(
    type='albersUsa'
)
alt.Chart(geo_states, title='POST-election Hate crime per 100k').mark_geoshape().encode(
    color='hate_crimes_per_100k_splc',
    tooltip=['NAME', 'hate_crimes_per_100k_splc']
).properties(
    width=500,
    height=300
).project(
    type='albersUsa'
)

24.4.1 Exploring data

import seaborn as sns
sns.pairplot(data = df.iloc[:,1:])

df.boxplot(column=['median_household_income'])
<Axes: >

df.boxplot(column=['avg_hatecrimes_per_100k_fbi'])
<Axes: >

We may want to drop columns (remove them). Details are here.

Let us drop Hawaii.

df[df.NAME == 'Hawaii']
NAME median_household_income share_unemployed_seasonal share_population_in_metro_areas share_population_with_high_school_degree share_non_citizen share_white_poverty gini_index share_non_white share_voters_voted_trump hate_crimes_per_100k_splc avg_hatecrimes_per_100k_fbi
11 Hawaii 71223 0.034 0.76 0.904 0.08 0.07 0.433 0.81 0.3 0.0 0.0
df = df.drop(df.index[11])
df.describe()
median_household_income share_unemployed_seasonal share_population_in_metro_areas share_population_with_high_school_degree share_non_citizen share_white_poverty gini_index share_non_white share_voters_voted_trump hate_crimes_per_100k_splc avg_hatecrimes_per_100k_fbi
count 50.000000 50.000000 50.000000 50.000000 47.000000 50.000000 50.000000 50.000000 50.00000 50.000000 50.000000
mean 54903.620000 0.049880 0.750000 0.868420 0.054043 0.092200 0.454180 0.305800 0.49380 0.281200 2.363200
std 9010.994814 0.010571 0.183425 0.034049 0.031184 0.024767 0.020889 0.150551 0.11674 0.255779 1.714502
min 35521.000000 0.028000 0.310000 0.799000 0.010000 0.040000 0.419000 0.060000 0.04000 0.000000 0.260000
25% 48358.500000 0.042250 0.630000 0.839750 0.030000 0.080000 0.440000 0.192500 0.42000 0.130000 1.290000
50% 54613.000000 0.051000 0.790000 0.874000 0.040000 0.090000 0.454500 0.275000 0.49500 0.215000 1.980000
75% 60652.750000 0.057750 0.897500 0.897750 0.080000 0.100000 0.466750 0.420000 0.57750 0.345000 3.182500
max 76165.000000 0.073000 1.000000 0.918000 0.130000 0.170000 0.532000 0.630000 0.70000 1.520000 10.950000
df.plot(x = 'avg_hatecrimes_per_100k_fbi', y = 'median_household_income', kind='scatter')
<Axes: xlabel='avg_hatecrimes_per_100k_fbi', ylabel='median_household_income'>

df.plot(x = 'hate_crimes_per_100k_splc', y = 'median_household_income', kind='scatter')
<Axes: xlabel='hate_crimes_per_100k_splc', ylabel='median_household_income'>

df[df.hate_crimes_per_100k_splc > (np.std(df.hate_crimes_per_100k_splc) * 2.5)]
NAME median_household_income share_unemployed_seasonal share_population_in_metro_areas share_population_with_high_school_degree share_non_citizen share_white_poverty gini_index share_non_white share_voters_voted_trump hate_crimes_per_100k_splc avg_hatecrimes_per_100k_fbi
8 District of Columbia 68277 0.067 1.00 0.871 0.11 0.04 0.532 0.63 0.04 1.52 10.95
37 Oregon 58875 0.062 0.87 0.891 0.07 0.10 0.449 0.26 0.41 0.83 3.39
47 Washington 59068 0.052 0.86 0.897 0.08 0.09 0.441 0.31 0.38 0.67 3.81
import matplotlib.pyplot as plt
outliers_df = df[df.hate_crimes_per_100k_splc > (np.std(df.hate_crimes_per_100k_splc) * 2.5)]
df.plot(x = 'hate_crimes_per_100k_splc', y = 'median_household_income', kind='scatter')

plt.scatter(outliers_df.hate_crimes_per_100k_splc, outliers_df.median_household_income ,c='red')
<matplotlib.collections.PathCollection at 0x169d83e50>

df_pivot = df.pivot_table(index=['NAME'], values=['hate_crimes_per_100k_splc', 'avg_hatecrimes_per_100k_fbi', 'median_household_income'])
df_pivot

##sort by values
#df_pivot = pd.pivot_table(df, index=['state'], columns = ['hate_crimes_per_100k_splc'], fill_value=0)
#df_pivot
#df2 = df_pivot.reindex(df_pivot['hate_crimes_per_100k_splc'].sort_values(by='hate_crimes_per_100k_splc', ascending=False).index)
avg_hatecrimes_per_100k_fbi hate_crimes_per_100k_splc median_household_income
NAME
Alabama 1.80 0.12 42278
Alaska 1.65 0.14 67629
Arizona 3.41 0.22 49254
Arkansas 0.86 0.06 44922
California 2.39 0.25 60487
Colorado 2.80 0.39 60940
Connecticut 3.77 0.33 70161
Delaware 1.46 0.32 57522
District of Columbia 10.95 1.52 68277
Florida 0.69 0.18 46140
Georgia 0.41 0.12 49555
Idaho 1.89 0.12 53438
Illinois 1.04 0.19 54916
Indiana 1.75 0.24 48060
Iowa 0.56 0.45 57810
Kansas 2.14 0.10 53444
Kentucky 4.20 0.32 42786
Louisiana 1.34 0.10 42406
Maine 2.62 0.61 51710
Maryland 1.32 0.37 76165
Massachusetts 4.80 0.63 63151
Michigan 3.20 0.40 52005
Minnesota 3.61 0.62 67244
Mississippi 0.62 0.06 35521
Missouri 1.90 0.18 56630
Montana 2.95 0.49 51102
Nebraska 2.68 0.15 56870
Nevada 2.11 0.14 49875
New Hampshire 2.10 0.15 73397
New Jersey 4.41 0.07 65243
New Mexico 1.88 0.29 46686
New York 3.10 0.35 54310
North Carolina 1.26 0.24 46784
North Dakota 4.74 0.00 60730
Ohio 3.24 0.19 49644
Oklahoma 1.08 0.13 47199
Oregon 3.39 0.83 58875
Pennsylvania 0.43 0.28 55173
Rhode Island 1.28 0.09 58633
South Carolina 1.93 0.20 44929
South Dakota 3.30 0.00 53053
Tennessee 3.13 0.19 43716
Texas 0.75 0.21 53875
Utah 2.38 0.13 63383
Vermont 1.90 0.32 60708
Virginia 1.72 0.36 66155
Washington 3.81 0.67 59068
West Virginia 2.03 0.32 39552
Wisconsin 1.12 0.22 58080
Wyoming 0.26 0.00 55690
df_pivot.sort_values(by=['avg_hatecrimes_per_100k_fbi'], ascending=False)
avg_hatecrimes_per_100k_fbi hate_crimes_per_100k_splc median_household_income
NAME
District of Columbia 10.95 1.52 68277
Massachusetts 4.80 0.63 63151
North Dakota 4.74 0.00 60730
New Jersey 4.41 0.07 65243
Kentucky 4.20 0.32 42786
Washington 3.81 0.67 59068
Connecticut 3.77 0.33 70161
Minnesota 3.61 0.62 67244
Arizona 3.41 0.22 49254
Oregon 3.39 0.83 58875
South Dakota 3.30 0.00 53053
Ohio 3.24 0.19 49644
Michigan 3.20 0.40 52005
Tennessee 3.13 0.19 43716
New York 3.10 0.35 54310
Montana 2.95 0.49 51102
Colorado 2.80 0.39 60940
Nebraska 2.68 0.15 56870
Maine 2.62 0.61 51710
California 2.39 0.25 60487
Utah 2.38 0.13 63383
Kansas 2.14 0.10 53444
Nevada 2.11 0.14 49875
New Hampshire 2.10 0.15 73397
West Virginia 2.03 0.32 39552
South Carolina 1.93 0.20 44929
Vermont 1.90 0.32 60708
Missouri 1.90 0.18 56630
Idaho 1.89 0.12 53438
New Mexico 1.88 0.29 46686
Alabama 1.80 0.12 42278
Indiana 1.75 0.24 48060
Virginia 1.72 0.36 66155
Alaska 1.65 0.14 67629
Delaware 1.46 0.32 57522
Louisiana 1.34 0.10 42406
Maryland 1.32 0.37 76165
Rhode Island 1.28 0.09 58633
North Carolina 1.26 0.24 46784
Wisconsin 1.12 0.22 58080
Oklahoma 1.08 0.13 47199
Illinois 1.04 0.19 54916
Arkansas 0.86 0.06 44922
Texas 0.75 0.21 53875
Florida 0.69 0.18 46140
Mississippi 0.62 0.06 35521
Iowa 0.56 0.45 57810
Pennsylvania 0.43 0.28 55173
Georgia 0.41 0.12 49555
Wyoming 0.26 0.00 55690
#This is code for standarization  
from sklearn import preprocessing
import numpy as np

#Get column names first
#names = df.columns
#df_stand = df[['median_household_income','share_unemployed_seasonal']]
df_stand = df[['median_household_income','share_unemployed_seasonal', 'share_population_in_metro_areas'
               , 'share_population_with_high_school_degree', 'share_non_citizen', 'share_white_poverty', 'gini_index'
               , 'share_non_white', 'share_voters_voted_trump', 'hate_crimes_per_100k_splc', 'avg_hatecrimes_per_100k_fbi']]
names = df_stand.columns
#Create the Scaler object
scaler = preprocessing.StandardScaler()
#Fit your data on the scaler object
df2 = scaler.fit_transform(df_stand)
df2 = pd.DataFrame(df2, columns=names)
df2.tail()
median_household_income share_unemployed_seasonal share_population_in_metro_areas share_population_with_high_school_degree share_non_citizen share_white_poverty gini_index share_non_white share_voters_voted_trump hate_crimes_per_100k_splc avg_hatecrimes_per_100k_fbi
45 1.261305 -0.657461 0.771002 -0.071795 0.193108 -0.905436 0.233085 0.497859 -0.379003 0.311206 -0.378961
46 0.466836 0.202590 0.605787 0.847894 0.841399 -0.089728 -0.637357 0.028181 -0.984716 1.535493 0.852428
47 -1.720951 2.209376 -1.101431 -1.199157 -1.427620 1.949543 -0.153778 -1.582146 1.697727 0.153233 -0.196315
48 0.356079 -0.657461 -0.330429 0.877562 -0.779329 -0.089728 -1.169293 -0.575692 -0.119412 -0.241698 -0.732470
49 0.088155 -0.944145 -2.423149 1.470910 -1.103475 -0.089728 -1.507798 -1.045370 1.784258 -1.110547 -1.239166
ax = sns.boxplot(data=df2, orient="h")

#wanted to remove row with Hawaii (row nr 11) following https://chrisalbon.com/python/data_wrangling/pandas_dropping_column_and_rows/

df2 = df.copy()
df2
#df2.drop('Hawaii')
#df2.drop(11) #drop Hawaii row
df2.drop(df.index[11])
df2.tail()
NAME median_household_income share_unemployed_seasonal share_population_in_metro_areas share_population_with_high_school_degree share_non_citizen share_white_poverty gini_index share_non_white share_voters_voted_trump hate_crimes_per_100k_splc avg_hatecrimes_per_100k_fbi
46 Virginia 66155 0.043 0.89 0.866 0.06 0.07 0.459 0.38 0.45 0.36 1.72
47 Washington 59068 0.052 0.86 0.897 0.08 0.09 0.441 0.31 0.38 0.67 3.81
48 West Virginia 39552 0.073 0.55 0.828 0.01 0.14 0.451 0.07 0.69 0.32 2.03
49 Wisconsin 58080 0.043 0.69 0.898 0.03 0.09 0.430 0.22 0.48 0.22 1.12
50 Wyoming 55690 0.040 0.31 0.918 0.02 0.09 0.423 0.15 0.70 0.00 0.26
import scipy.stats
#instead of running it one by one for every pair of variables, like:
#scipy.stats.pearsonr(st_wine.quality.values, st_wine.alcohol.values) 

corrMatrix = df2.corr(numeric_only=True).round(2)
print (corrMatrix)
                                          median_household_income  \
median_household_income                                      1.00   
share_unemployed_seasonal                                   -0.34   
share_population_in_metro_areas                              0.29   
share_population_with_high_school_degree                     0.64   
share_non_citizen                                            0.28   
share_white_poverty                                         -0.82   
gini_index                                                  -0.15   
share_non_white                                             -0.00   
share_voters_voted_trump                                    -0.57   
hate_crimes_per_100k_splc                                    0.33   
avg_hatecrimes_per_100k_fbi                                  0.32   

                                          share_unemployed_seasonal  \
median_household_income                                       -0.34   
share_unemployed_seasonal                                      1.00   
share_population_in_metro_areas                                0.37   
share_population_with_high_school_degree                      -0.61   
share_non_citizen                                              0.31   
share_white_poverty                                            0.19   
gini_index                                                     0.53   
share_non_white                                                0.59   
share_voters_voted_trump                                      -0.21   
hate_crimes_per_100k_splc                                      0.18   
avg_hatecrimes_per_100k_fbi                                    0.07   

                                          share_population_in_metro_areas  \
median_household_income                                              0.29   
share_unemployed_seasonal                                            0.37   
share_population_in_metro_areas                                      1.00   
share_population_with_high_school_degree                            -0.27   
share_non_citizen                                                    0.75   
share_white_poverty                                                 -0.39   
gini_index                                                           0.52   
share_non_white                                                      0.60   
share_voters_voted_trump                                            -0.58   
hate_crimes_per_100k_splc                                            0.26   
avg_hatecrimes_per_100k_fbi                                          0.21   

                                          share_population_with_high_school_degree  \
median_household_income                                                       0.64   
share_unemployed_seasonal                                                    -0.61   
share_population_in_metro_areas                                              -0.27   
share_population_with_high_school_degree                                      1.00   
share_non_citizen                                                            -0.30   
share_white_poverty                                                          -0.48   
gini_index                                                                   -0.58   
share_non_white                                                              -0.56   
share_voters_voted_trump                                                     -0.13   
hate_crimes_per_100k_splc                                                     0.21   
avg_hatecrimes_per_100k_fbi                                                   0.16   

                                          share_non_citizen  \
median_household_income                                0.28   
share_unemployed_seasonal                              0.31   
share_population_in_metro_areas                        0.75   
share_population_with_high_school_degree              -0.30   
share_non_citizen                                      1.00   
share_white_poverty                                   -0.38   
gini_index                                             0.51   
share_non_white                                        0.76   
share_voters_voted_trump                              -0.62   
hate_crimes_per_100k_splc                              0.28   
avg_hatecrimes_per_100k_fbi                            0.30   

                                          share_white_poverty  gini_index  \
median_household_income                                 -0.82       -0.15   
share_unemployed_seasonal                                0.19        0.53   
share_population_in_metro_areas                         -0.39        0.52   
share_population_with_high_school_degree                -0.48       -0.58   
share_non_citizen                                       -0.38        0.51   
share_white_poverty                                      1.00        0.01   
gini_index                                               0.01        1.00   
share_non_white                                         -0.24        0.59   
share_voters_voted_trump                                 0.54       -0.46   
hate_crimes_per_100k_splc                               -0.26        0.38   
avg_hatecrimes_per_100k_fbi                             -0.26        0.42   

                                          share_non_white  \
median_household_income                             -0.00   
share_unemployed_seasonal                            0.59   
share_population_in_metro_areas                      0.60   
share_population_with_high_school_degree            -0.56   
share_non_citizen                                    0.76   
share_white_poverty                                 -0.24   
gini_index                                           0.59   
share_non_white                                      1.00   
share_voters_voted_trump                            -0.44   
hate_crimes_per_100k_splc                            0.12   
avg_hatecrimes_per_100k_fbi                          0.08   

                                          share_voters_voted_trump  \
median_household_income                                      -0.57   
share_unemployed_seasonal                                    -0.21   
share_population_in_metro_areas                              -0.58   
share_population_with_high_school_degree                     -0.13   
share_non_citizen                                            -0.62   
share_white_poverty                                           0.54   
gini_index                                                   -0.46   
share_non_white                                              -0.44   
share_voters_voted_trump                                      1.00   
hate_crimes_per_100k_splc                                    -0.69   
avg_hatecrimes_per_100k_fbi                                  -0.50   

                                          hate_crimes_per_100k_splc  \
median_household_income                                        0.33   
share_unemployed_seasonal                                      0.18   
share_population_in_metro_areas                                0.26   
share_population_with_high_school_degree                       0.21   
share_non_citizen                                              0.28   
share_white_poverty                                           -0.26   
gini_index                                                     0.38   
share_non_white                                                0.12   
share_voters_voted_trump                                      -0.69   
hate_crimes_per_100k_splc                                      1.00   
avg_hatecrimes_per_100k_fbi                                    0.68   

                                          avg_hatecrimes_per_100k_fbi  
median_household_income                                          0.32  
share_unemployed_seasonal                                        0.07  
share_population_in_metro_areas                                  0.21  
share_population_with_high_school_degree                         0.16  
share_non_citizen                                                0.30  
share_white_poverty                                             -0.26  
gini_index                                                       0.42  
share_non_white                                                  0.08  
share_voters_voted_trump                                        -0.50  
hate_crimes_per_100k_splc                                        0.68  
avg_hatecrimes_per_100k_fbi                                      1.00  
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt

corrMatrix = df2.corr(numeric_only=True).round(1)  #I added here ".round(1)" so that's easier to read given number of variables
sn.heatmap(corrMatrix, annot=True)
plt.show()

import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn import metrics

x = df2[['median_household_income', 'share_population_with_high_school_degree', 'share_voters_voted_trump']]
y = df2[['avg_hatecrimes_per_100k_fbi']]
#what if we change the y variable
#y = df2[['hate_crimes_per_100k_splc']]

est = LinearRegression(fit_intercept = True) 
est.fit(x, y)

print("Coefficients:", est.coef_)
print ("Intercept:", est.intercept_)

model = LinearRegression()
model.fit(x, y)
y_hat = model.predict(x)
print ("MSE:", metrics.mean_squared_error(y, y_hat))
print ("R^2:", metrics.r2_score(y, y_hat))
print ("var:", y.var())
Coefficients: [[-1.63935828e-05  7.65352737e+00 -7.85302986e+00]]
Intercept: [0.49461694]
MSE: 2.1105276140605045
R^2: 0.26736253642536767
var: avg_hatecrimes_per_100k_fbi    2.939516
dtype: float64